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Abstract 



We address a frequently asked question on the covariance fitting of the highly correlated 
data such as our Bk data based on the SU(2) staggered chiral perturbation theory. Ba- 
sically, the essence of the problem is that we do not have an accurate fitting function 
enough to fit extremely precise data. When eigenvalues of the covariance matrix are 
small, even a tiny error of fitting function yields large chi-square and spoils the fitting 
procedure. We have applied a number of prescriptions available in the market such as 
the cut-off method, modified covariance matrix method, and Bayesian method. We also 
propose a brand new method, the eigenmode shift method which allows a full covariance 
' fitting without modifying the covariance matrix at all. In our case, the eigenmode shift 

(ES) method and Bayesian method turn out to be the best prescription to the problem. 
We also provide a pedagogical example of data analysis in which the diagonal approx- 
imation and the cut-off method fail in fitting manifestly, but the ES method and the 
Bayesian approach work well. 
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1. Introduction 

We have reported results of Bk calculated using improved staggered fermions with 
Nf = 2 + 1 flavors in Ref. [10]. We refer to Ref. and @ as SW-1 and SW-2 afterwards. 
In SW-1, we use three different lattice spacings to control the discretization errors. The 
dominant error in this result comes from uncertainty in the matching factors. One hidden 
uncertainty is that we use the diagonal approximation (uncorrelated fitting) instead of 
the full covariance fitting in SW-1. In fact, one of the most frequently asked questions 
on SW-1 is why we do the uncorrelated fitting (i.e. using the diagonal approximation) 
instead of the full covariance fitting. Here, we would like to address this issue on the 
covariance fitting and the diagonal approximation. 
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A significant difficulty in fitting the highly correlated data has been pointed out in 
the literature such as Refs. 0-0] ■ In the literature, they have proposed a number of 
prescriptions such as diagonal approximation 0, modifying the covariance matrix 7|, 
and cut-off method in the popular name of singular value decomposition (SVD) 0-5|. 
The weakness of these approaches is that all of these methods try to modify the covariance 
matrix one way or another. Hence, we lose the true meaning of % 2 and we do not know 
the quality of fitting in this case. 

Therefore, we propose a new method, the eigenmode shift (ES) method, which does 
not modify the covariance matrix but only use our freedom to modify the fitting func- 
tional form based on the Bayesian method. It turns out that the ES method allows for 
a probability interpretation of quality of fitting based on the Bayesian \ 2 distribution^ 
An alternative approach is the orthodox Bayesian method. In this approach, we add 
higher order terms to the fitting function with proper constraints until it fits the data. 
This also turns out to be another good solution to the problem. 

The paper is organized as follows. In Sec. [21 we review the covariance fitting process 
and give a physical meaning of covariance matrix. In Sec. [31 we address the problem 
with small eigenvalues of covariance matrix. In Sec. [4l we list the possible solutions to 
the problem and discuss about the pros and cons. Here, we explain the ES method. In 
Sec. we give a theoretical background for the pdf (probability distribution function) 
of the eigenvalues of the covariance matrix. In Sec. [HI we close with some concluding 
remarks. 



2. Review of covariance fitting 

Let us consider N samples of unbiased estimates of quantity yt with i = 1,2,3, ... ,D. 
Here, the data set is {yi(n)\n = 1, 2, 3, ... , N}. Let us assume that the samples yi(n) 
are statistically independent in n for fixed i but are substantially correlated in i. For 
example, a similar situation occurs in lattice gauge theory calculations where there are N 
independent gauge configurations and Bk values with multiple choices of different quark 
mass pairs of m x (valence down quark mass) and m y (valence strange quark mass), 
which corresponds to D Green functions measured over the gauge configurations. An 
introduction to this subject is given in Ref. 

The fitting functional form suggested by the SU(2) staggered chiral perturbation 
theory (SChPT) is linear as follows, 

p 

f th (X) = J2c a F a (X) (1) 

a=l 

where c a are the low energy constants (LECs) and F a is a function of X which represents 
collectively Xp (pion squared mass of xx), Yp (pion squared mass of yy), and so on. 
The details on F a and X are given in SW-1. Here, we focus on the X-fit of 4X3Y-NNLO 
fitting of the SU(2) SChPT, which is explained in great detail in SW-1. In this fit, we 
have three LECs and so P = 3. 



1 This is different from the normal x 2 distribution, which assumes the uniform prior information. We 
will address this issue when we discuss on the Bayesian method. 
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We are interested in the probability distribution of the average yi of the data ydn). 

1 N 

V* = AT £ ( 2 ) 



N 

n=l 



We assume that the measured values of ?A have a normal distribution P(y) by the central 
limit theorem for the multivariate statistical analysis as follows, 



1 1 ° 

P(y) = - exp[-- (Si ~»i)( N r^Xfo - Mi)] , (3) 

where /Xj represents the true mean value of j/j, which is, in general, unknown and can be 
obtained as N — > oo, and 

Z = / [<fy] exp[-- £ (ft ~ W)(JV r^)fe - ^)] ■ 

Here, Ty is the true covariance matrix, which is, in general, unknown in our problems. 
The maximum likelihood estimator of Tij turns out to be the sample covariance matrix 
Sij defined as follows, 

N 



n=l 



Cy — jy Sij , (5) 



where Cy is the normalized sample covariance matrix. Here, note that the covariance 
matrix is a symmetric and positive definite matrix which has real and positive cigcnval- 
uefl We assume that our theorjjfl must describe the data well. Then, 

p 

fii -+Vi = fth(Xi) = ^ c a F a (Xi) . 

a=l 

In other words, we want to test whether describes the data reliably from the standpoint 
of statistics. In this procedure, we want to determine c a (LECs) to give the best fit. Here, 
the best fit is defined by minimizing the T 2 of the numerical results {y~i}, where the T 2 
is defined by 

D 

T 2 = Y,[m- Vi ][N S^lyj-Vj}. (6) 

i,3=l 



2 Here, the positive means that the eigenvalues of the covariance matrix cannot be negative. In other 
words, some of them may be zero and the rest are positive. 
3 Here, it means the SU(2) SChPT. 
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We notice that Yi — \/N[yi — Vi] is distributed according to J\f(p, T) and pi = y/N[fii — vi\. 
Here, we use the same notation as in Ref. [l(|. Then, note that (N—l)Sij is independently 
distributed as 

JV-l 

Yl Z i( n ) Z j( n ) 
n=l 

where Z(n) is distributed according to Af(Q,T). In this case, [T 2 /(N - 1)][(N — d)/d] is 
distributed as a non-central F distribution of Fd,N-d, which is defined in Ref. (lQ| . and 
its non-centrality parameter is 



Here, d is the degrees of freedom of the fitting, defined by d = D — P. In Ref. [lfj, it 
is proved that the limiting distribution of T 2 as N — > oo is the ^-distribution with P 
degrees of freedom if pi — v^. 

At this point, we have to minimize T 2 in order to determine the LECs: {c a }. Hence, 
we need to solve the following equation: 

dT 2 

We can rewrite this equation as follows, 

^ =2 £ d J^Cj [fth{ X j )-y j] =0 (8) 
*.i=i a 

p 

A ab c b = h a (9) 



6=1 



where 



D 



A ab = ]T F a {Xi)C^F h {Xj) (10a) 

h a = FaiXijCr^ . (10b) 

Here, note that the matrix A ab is a symmetric matrix. The solution can be obtained by 
simply solving the linear algebra. 

Ac = h -> c = A~ 1 h (11) 

So far, so good. 

One caveat is that the solution of Eq. (fTTj) exists only if the covariance matrix C 
is non-singular. In practice, even though the covariance matrix has only very small 
eigenvalues, it is enough to cause a very poor fitting. We will address this issue in the 
next sections. 
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parameter 


value 


sea quarks 


asqtad staggered fermions 


valence quarks 


HYP staggered fermions 


gluons 


Symanzik improved gluon action 


geometry 


20 3 x 64 


number of confs 


671 


number of meas 


9 per conf 


ami/am s 


0.01/0.05 


1/a 


1662 MeV 


Us 


0.3286 at fx= 1/a 


am x , am y 


0.005 x n (n = 1,2, 3,..., 10) 



Table 1: Parameters for the numerical study on the coarse (C3) ensemble, mj is the light sea quark 
mass, m s the strange sea quark mass, m x the light valence quark mass, and m y the strange valence 
quark mass. Here, "conf" and "meas" represent gauge configuration and measurement, respectively. 



/(*) 


memo 




ci + c 2 X 

a + c 2 x + c 3 x 2 
MX) 


trial 
trial 
SChPT 


1.47(73) 12.6(50) 
0.19(13) 8.5(59) 
0.16(12) 7.2(54) 



Table 2: List of fitting functions and its quality. f(X) is the fitting function. The "memo" means 
the theoretical origin of the fitting functions. Xdiag represents T 2 with diagonal approximation and x 2 
represents T 2 with the full covariance matrix. 



3. Trouble with covariance matrix for Bk 

First, we address the issue on the quality of the fitting function form suggested by 
SChPT. Second, we would like to address a typical difficulty with a general covariance 
fitting of highly correlated data. To demonstrate the problem, we choose the Bk data on 
a coarse (C3) ensemble out of MILC asqtad lattices using the notation of SW-1 and S W- 
2. This ensemble is particularly a good sample, because it has relatively large statistics. 
The input parameters for the C3 ensemble is summarized in Table [T] 

3.1. Quality of the fitting function 

The fitting function given in Eq. (JTJ) is derived from SChPT. Hence, its reliability is 
directly related to the validity of SChPT. It is beyond the scope of this paper to discuss 
the validity of SChPT, which has been proved to be true through extensive numerical 
study in Ref. [J Il2l4l5j . Here, we take a different approach to the same issue. Basically, 
we take some trial fitting functions which do not have any solid theoretical background 
such as SChPT but are highly empirical. We choose two empirical fitting functions: one 
is linear and the other is quadratic in 

_ ml(xx;£ 5 ) 



where A = l.OGeV (a generic scale of chiral perturbation theory). The fitting results are 
summarized in Tabled From the table, we observe that the x 2 value for f t h is consistently 

5 



smallest, which indicates that our choice of the fitting function is quite optimal. 

We will get back to this issue when we discuss about the full covariance fitting and 
its trouble. 



3.2. Full covariance fitting 

As one can see in Table [TJ we have 55 combinations of m x and m y (10 degenerate 
pairs (m x — rn y ) and 45 non-degenerate pairs (m x ^ my)), and so D — 55 for this 
example. Since the number of gauge configurations is 671, N — 671. The fit type is 
the X-fit of the 4X3Y-NNLO fitting of the SU(2) SChPT as explained in SW-1, and so 
P = 3. In a single X-fit, we use only 4 data points and so we may say that D = 4. 
Now, let us consider the D x D covariance matrix. It has only 10 (= D(D + l)/2) 
components which can be determined completely in a linearly independent way, using 
671 independent configurations. Here, we focus on the troublesome small eigenvalues of 
the covariance matrix. 

To be concrete, let us walk through a specific example of Bk to demonstrate the 
problem and its consequence. In the X-fit, we fix am y = 0.05 and select 4 data points 
of am x = 0.005, 0.010, 0.015, 0.020 to fit to the functional form suggested by the SU(2) 
SChPT as in SW-1. Hence, the covariance matrix CV, is a 4 x 4 matrix. 



1.42, 0.661, 0.398, 0.274 

0.661, 0.392, 0.271, 0.204 

0.398, 0.271, 0.205, 0.165 

0.274, 0.204, 0.165, 0.138 



x 10" 



(12) 



Its eigenvalues are 



A, 



{ 1.95 x 10~ 5 , 1.92 x 10 -6 , 7.58 x 10~ 8 , 1.11 x 10~ 9 } . 



(13) 



The components of the matrix are between 1.42 x 10~ 5 and 1.38 x 10~ 6 . In the 
meanwhile, the smallest eigenvalue is smaller than the components by three orders of 
magnitude. 

Now let us look into the eigenvectors: 





' 0.837 " 




' -0.508 " 




0.202 " 




' -0.0378 " 




0.429 




0.387 




-0.739 




0.347 


Vl = 


0.276 


, v 2 = 


0.542 


, ^3 = 


0.0725 


, v 4 = 


-0.790 




0.200 




0.546 




0.639 




0.503 



(14) 



The eigenvector corresponds to the smallest eigenvalue. This eigenmode dominates 
the \ 2 fitting completely. Let us expand y in terms of eigenvectors as follows, 



V = ^2 aiVi ' 



(15) 



where is the eigenmode projection coefficient. Let us define a, as follows, 
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a, 



6 



i 


1 


2 


3 


4 


CLi 


1.021(4) 


0.5655(14) 


0.1061(3) 


0.01442(3) 


ati 


0.7589(18) 


0.2328(18) 


0.008190(69) 


0.0001513(12) 



Table 3: Eigenmode projection coefficients for y. 



Here, note that on represents the probability of the specific eigenmode in the data y. In 
Table |3l we show ai and a, for the data y. The eigenmode v\ and vi describes more than 
99% of the data y. The remaining 0.8% is occupied by w 3 and the rest 0.015% is from 
v 4 . 

We can rewrite the inverse covariance matrix as follows, 



k=l 



Hence, we can also express the T 2 as follows, 



4 ! 



(16) 



(17) 



where Ayj = [jjj — Here, note that the least T 2 fitting is completely dominated by V4, 
which becomes an inconvenient truth and causes a serious trouble in covariance fitting. 

In Figure [TJ we show the fitting results with the full covariance matrix. As one can 
see, the fitting curve does not pass through the data points. Hence, the quality of fitting 
looks poor to our eyes. The T 2 value is 

T 2 = 7.2 ± 5.4. 



The multivariate statistical theory predicts the following [If 

d 



8{T 2 ) 
V{T 2 ) 



(d + K) 
2(d + 2n 



1 



1 



{d + K) 2 

d+2n 



(18) 
(19) 



where £(T 2 ) and V(T 2 ) represent the expectation value and variance of the T 2 distri- 
bution. Here, d is the degrees of freedom and re is the non-centrality parameter. If the 
degrees of freedom is comparable to the number of samples (d rs N) , the leading devia- 
tion of the T 2 distribution from the \ 2 distribution becomes of order 0(1) and so we can 
not use the \ 2 distribution in this case, which is also pointed out in Ref. [1] in the con- 
text of distorted normal distribution. In Ref. [13] , the sample size effect is systematically 
explained in terms of 1 /N expansion. 

However, in our example of Bk, d = 1 and N = 671. Hence, the leading correction 
of the T 2 distribution to the x 2 distribution will be negligibly small (w 0.15%). In our 
example, the non-centrality parameter can be estimated by re w T 2 — d = 6.2. Then, we 
can obtain V(T 2 ) as follows, 



V(T 2 



2{d- 
7 



■2k) = 26.8 



(20) 



0.61 



0.6 



0.59 - 



m 



0.58 



0.57 



0.56 




Figure 1: B K {l/a) vs. X P on the C3 ensemble. The fit type is 4X3Y-NNLO in the SU(2) analysis. 
We fix am y = 0.05. The red line represents the results of fitting with the full covariance matrix. The 
red diamond corresponds to the Bk value obtained by extrapolating m x to the physical light valence 
quark mass after setting all the pion multiplet splittings to zero. 



Hence, the error of T 2 is supposed to be y/26.8 — 5.2, which is reasonably consistent 
with the measured value 5.4. The k is the non-centrality parameter which represents 
how much the fitting function deviates from the true mean values. In our example, 
k = 6.2 turns out to be a rather large value which comes from the fact that the small 
deviation of our fitting function from the true value due to the truncation of the higher 
order terms in the series expansion of the SU(2) SChPT can be amplified dramatically 
if there are small eigenvalues in the covariance matrix. 

Let us decompose the fitting function in terms of eigenmodes as follows, 



ft 



Ah = kvi 



(21) 
(22) 



To see how much the fitting function deviates from the data in a specific eigenmode, we 
define the difference, Si, as follow, 



(23) 



As summarized in Tabled the difference, Si, is 7.22 x 10~ 3 for V\, 2.40 x 10~ 3 for v%, 
3.28 x 10~ 4 for v$, whereas it is only 9.69 x 10~ 6 for v±. Hence, the procedure of the least 
X 2 fitting works hard for the coefficient of V4 but work less precisely for the coefficients 



i 


1 


2 


3 


4 


bi 


1.014(4) 


0.5679(11) 


0.1058(3) 


0.01443(3) 


a>i 


1.021(4) 


0.5655(14) 


0.1061(3) 


0.01442(3) 


10 5 • S, 


722(270) 


240(90) 


32.8(123) 


0.969(362) 


ft 


0.7548(10) 


0.2368(9) 


0.008212(69) 


0.0001528(11) 



Table 4: Eigenmode decomposition of / t h for the full covariance fitting. 



of v\ and v-i mainly because the eigenvalue A4 is significantly smaller than Ai and A2. 
The irony is that the data has only 0.015% overlap with V4 while more than 99% of 
it is dominated by v\ and t>2- In this sense, the failure of the full covariance fitting is 
obviously due to the fact that the least % 2 fitting tries to determine the coefficient of V4 
component of the data very precisely but lose precision in determining the coefficients 
of the v\ and vi components. As a consequence, the fitting curve misses the data points 
and the quality of fitting looks poor to our eyes. 

How can we improve the situation? We will address this issue in the next section. 

4. Prescription 

Here, we present a number of potential solutions to the problem raised in the previous 
section. Part of the solutions are well-known but vulnerable. Part of the solutions arc 
new but of noteworthy merit. We will go through them one by one, and discuss about 
the pros and cons. 

4-1. Diagonal approximation 

One simple solution to the problem is to use the diagonal approximation (uncorrelated 
fitting) Q. In this method, we neglect the off-diagonal covariance as follows: 

Cij =0 if i ^ j . (24) 

In this way, the small eigenvalue problem disappears and the fitting results are shown in 
Figure [H 

The fitting function / t h in the diagonal approximation is decomposed into eigenmodes 
of the full covariance matrix in Table [5] In this fit, the difference, Si, is 3.80 x 10~ 4 for 
Vi, 4.26 x 10~ 4 for v 2 , 5.23 x 10~ 4 for v 3 and 4.85 x 10~ 4 for u 4 . Here, note that the 
diagonal approximation method removes the small eigenvalues and so it takes all the 
eigenmodes, equally. As a result, the differences for all directions are less than or equal 
to 5.23 x 10~ 4 . Hence, the fitting looks quite reasonable to our eyes as one can see in 
Figure [2j 

In the diagonal approximation, we lose the physical meaning of x 2 , because we un- 
derestimate the x 2 by neglecting the off-diagonal terms in the covariance matrix. This 
point is a major drawback of the diagonal approximation. 

4-2. Cutoff method 

Another solution is to exclude the 1*4 eigenmode from the fitting. Since we know 
that the V4 eigenmode has least significant contribution in the data, we can think of 
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0.61 



0.6 - 



0.59 



pq 



0.58 



0.57 



0.56 




0.05 



0.1 
X p (GeV 2 ) 



0.15 



0.2 



Figure 2: %(l/a) vs. Xp on the C3 ensemble. All the parameters are the same as in Figurc[T] Here, 
the red line represents the results of the uncorrelatcd fitting using the diagonal approximation. 



i 


1 


2 


3 


4 




1.021(4) 


0.5659(13) 


0.1056(3) 


0.01490(18) 


a,i 


1.021(4) 


0.5655(14) 


0.1061(3) 


0.01442(3) 


10 5 • Si 


38.0(142) 


42.6(159) 


52.3(195) 


48.5(181) 


Pi 


0.7586(17) 


0.2332(16) 


0.008112(77) 


0.0001617(35) 



Table 5: Eigenmode decomposition of for the fitting with diagonal approximation. 
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0.61 I — i — i — i — i — | — i — i — i — i — | — i — i — i — i — | — i — i — i — i 



0.6 — \ cutoff method 




0.58 - -r 



<> 

0.57 - - 



0.56 I 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

0.05 0.1 0.15 0.2 

X p (GeV 2 ) 

Figure 3: %(l/n) vs. Xp on the C3 ensemble. All the parameters are the same as in Figure [2] Here, 
the red line represents the results of the covariance fitting after removing the smallest eigenvalue using 
the cutoff method. 



i 


1 


2 


3 


4 


h 


1.021(4) 


0.5655(14) 


0.1061(3) 


0.01524(30) 




1.021(4) 


0.5655(14) 


0.1061(3) 


0.01442(3) 


10 5 • St 


0(0) 


0(0) 


0(0) 


82.1(307) 


Pi 


0.7589(18) 


0.2328(18) 


0.008190(69) 


0.0001690(63) 



Table 6: Eigenmode decomposition of / t h for the fitting with the cutoff method. 



the philosophy of removing them by hand as suggested by Refs. [fl, [H|. A popular and 
systematic way of chopping away the eigenmode is to set up such a cutoff that we 
project out the eigenvectors of those eigenvalues smaller than the cutoff into a null space 
of the inverse covariance matrix . We call this the cutoff method. A number of 
lattice QCD groups [mil use this method in the popular name of the SVD (singular 
value decomposition) method. 

Now let us walk through an example to demonstrate how it works. In our example of 
Bk, we have three parameters to determine from the fit. Hence, it is possible to remove 
only one eigenmode out of the four by setting I/A4 = 0. In Figure [21 we show the 
results of the covariance fitting using the cutoff method. It is amusing to see how good 
it works. The results are consistent with those in Figure [2j 

In Table [5] we decompose the fitting function / tn obtained using the cutoff method 
into eigenmodes of the full covariance matrix. In this case, the difference, Si, is zero 
for i = 1,2,3 and is 8.21 x 10 -4 in V4. In this method, we do not care about the V4 
eigenmode at all. As a result, the fitting quality looks quite good to our eyes, which is 
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quite consistent with that of the diagonal approximation. 

However, a major drawback of the cutoff method is that we cannot give the physical 
meaning to the quality of fit, which is normally reflected in the minimized value of x 2 ■ 
In Appendix |Appcndix B.2| we show that the probability distribution of x 2 defined in 
cutoff method is the \ 2 distribution with degrees of freedom equal to D — P — R. Here, 
D is the number of data points, P is the number of fitting parameters and R is the 
number of removed eigenmodes. However, even though we know the distribution of the 
X 2 in cutoff method, we cannot measure the quality of fit through the minimized value 
of x 2 - This is because we remove some eigenmodes and the \ 2 m the cutoff method is 
orthogonal to the removed eigenmodes. As you know, the physical \ 2 has D — P degrees 
of freedom, while the \ 2 °f the cut-off method possesses D — P — R degrees of freedom. 
Unfortunately, the missing degrees of freedom, R are physical. 

The situation becomes even worse when we use the resampling method, such as 
jackknife method or bootstrap method. When the size of covariance matrix is large, we 
might lose a control over the number of the small eigenvalues that we remove. In other 
words, in one jackknife sample, we may remove two small eigenvalues and in another 
jackknife sample, we may remove three of them. During the procedure, the definition of 
X 2 is shifting from one to another. 

4-3. Modified covariance matrix 



One may take another approach to handle the small eigenmodes as in Ref. [22|. We 
first define the correlation matrix C as follows, 

*i = \fC~ l (25) 

Ca = ^- (26) 

such that the diagonal components of Cij is 1. In our example of Bk, it is 



Cij 



1.000 0.888 0.738 0.619 

0.888 1.000 0.955 0.877 

0.738 0.955 1.000 0.978 

0.619 0.877 0.978 1.000 



(27) 



Hence, we can say that the correlation matrix is a normalized covariance matrix. In our 
example, the off-diagonal terms are quite large between 0.6 and 1.0, which indicates that 
the data is highly correlated and moving together in the same direction. The eigenvalues 
of the correlation matrix are 

Xi = { 3.54, 0.437, 0.0249, 0.000521} (28) 

One may choose their cutoff as A cu t = 1 as in Ref. j22[. We remove by hand all the 
eigenmodes whose eigenvalue is smaller than A cu t . In our example, this corresponds to 
removing three eigenmodes of A2, A3, and A4. It is obvious that the remaining correlation 
matrix is highly singular. In order to avoid the singular behavior, we restore the diagonal 
components back to 1 by hand as in Ref. [22| ■ Let us call this modified correlation matrix 
Cij. Then, let us define the modified covariance matrix as 

Mij = dj x crj x Oj (29) 
12 



0.61 



0.6 



0.59 



m 



0.58 



0.57 



0.56 




0.05 



0.1 
X p (GeV 2 ) 



0.15 



0.2 



Figure 4: Bjf(l/a) vs. Xp on the C3 ensemble. All the parameters are the same as in Figure[2] Here, 
the red line represents the results of the covariance fitting using the modified covariance matrix. 



i 


1 


2 


3 


4 


h 


1.018(4) 


0.5665(12) 


0.1056(3) 


0.01473(12) 


at 


1.021(4) 


0.5655(14) 


0.1061(3) 


0.01442(3) 


10 5 • 8, 


289(108) 


103(38) 


48.9(182) 


30.9(116) 


ft 


0.7573(13) 


0.2344(13) 


0.008143(73) 


0.0001584(24) 



Table 7: Eigenmode decomposition of / t h for the fitting with the MCM method. 



In our example of Bk , it is 



1.42 
0.632 
0.453 
0.353 



0.632 
0.392 
0.275 
0.214 



0.453 
0.275 
0.205 
0.153 



0.353 
0.214 
0.153 
0.138 



x 10" 



(30) 



Then, we use My as their covariance matrix and fit the data. We call this the modified 
covariance matrix (MCM) methodQ 

In Figure|4j we show the results of the covariance fitting using the modified covariance 
matrix. The fitting quality is somewhere between the diagonal approximation and the 
full covariance fitting. 

In Table [7] we decompose the fitting function / t h obtained using the MCM method 
into eigenmodes of the full covariance matrix. In this case, the difference, Si, is 2.89 x 10 -3 



4 MILC does not use this method anymore in their fitting 23] . 
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for vi, 1.03 x 10~ 3 for v 2 , 4.89 x 10~ 4 for v 3 and 3.09 x 10~ 4 for v 4 . As a result, the fitting 
quality look mediocre to our eyes and worse than that of the diagonal approximation. 

The main disadvantage of using the MCM method is that the x 2 loses the physi- 
cal meaning completely, because the modified covariance matrix is significantly different 
from the full covariance matrix by construction. However, one may view the diagonal ap- 
proximation as a special case of the MCM method where all the eigenmodes are removed. 
In this sense, the MCM method is as bad as the diagonal approximation. 

4-- 4- Eigenmode shift method 

So far, all the methods are based on the philosophy that we may manipulate or modify 
the covariance matrix. This philosophy is very dangerous in a sense that the modification 
results in a missing information in physics. The missing information is highly physical. 
Hence, it is not a good idea to modify the covariance matrix. The only degrees of freedom 
that we have is to modify the fitting function, but not the covariance matrix. 

We know that the whole trouble comes from the inexact fitting function that has 
small error in V4 eigenmode direction. Hence, we can think of a new fitting function / t ' h 
defined as follows, 

/ t ' h (X) = / th (X)+^ 4 (31) 

Here, r\ is a tin y p arameter which can be determined using the Bayesian method. The 
Bayes theorem [24| says that 

P(A\B,I) oc P(B\A,I) x P(A\I) (32) 

Here, A represents our theoretical hypothesis, D is the data, and I corresponds to the 
background information. Note that P(H)\A,I) means the probability that we obtain the 
data set of D if A and / are given to us. We know the conditional likelihood function of 
P(P\A,I), which is nothing but 

P(B\A,I) cx expf—) ( 33 ) 
X 2 = Yfii-v&Cr+fo-i/A (34) 

v{ = /thPQ), (35) 



as explained in Ref . [24 1 



In addition, if we impose the maximum entropy principle [24| on the prior condition 
that a n — (jjj < 77 < a, q + a v , then the prior becomes the following: 

P(A\I) oc exp^-^j (36) 

Xprior = 5 

(37) 



Then, we obtain the posterior pdf as follows: 



P(A\B,I) « exp^-^pj (38) 

Xaug — X Xprior 

(39) 
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Figure 5: B^(l/a) vs. Xp on the C3 ensemble. All the parameters are the same as in Figure[2] Here, 
the red line represents the results of the eigenmode shift method. 

The Bayesian principle [24j | is to determine the fitting parameters by maximizing the 
posterior pdf: P(A|B, J). This is equivalent to minimizing the Xaug- 

Let us switch the gear to our choice of the prior condition. From the SChPT, the 
neglected highest order term in the fth(X) is 

X 2 (ln(X)) 2 « 0.006 

where X = Xp/A 2 s» 0.02. The only constraint on the coefficient C4 of this term is that 
C4 = ± 1, but we do not know the sign of C4. Hence, we set a v = and a v — 0.006. 

Then wc perform the full covariance fitting with the extra fitting parameter, 77, by 
minimizing x^ug (the- Bayesian principle) . The obtained 77 is 

77 = -0.00082(31) . 

When we do the extrapolation to the physical pion mass, we use only the fth(X) function, 
dropping out the 77 term, which is too small to make any difference at any rate. We call 
this the eigenmode shift (ES) method. 

In Figure[5l we show the fitting results obtained using the ES method. In our example, 
tiny correction proportional to 77 makes the fitting results that pass through the average 
data point. 

In Table [51 we show the eigenmode decomposition of /th when we use the ES method 
in fitting. As one can see in the table, the Si is smaller by order of magnitude compared 
with the diagonal approximation for i — 1,2,3. The only non-trivial component is #4, 
which is taken care of by the shift parameter 77. As a result, in Table |H the 84 for 
/ t ' h becomes negligibly small by the shift parameter 77. The success of the fitting can 
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i 


12 3 4 


bi 
a>i 
10 5 • S, 

ft 


1.021(4) 0.5655(14) 0.1061(3) 0.01524(30) 
1.021(4) 0.5655(14) 0.1061(3) 0.01442(3) 
1.88(70) 0.624(233) 0.0855(319) 81.9(306) 
0.7589(18) 0.2328(18) 0.008190(69) 0.0001690(63) 


Table 8: Eigenmode decomposition of f t ^ for the fitting with the ES method. 


i 


12 3 4 


h 

10 5 • s, 

ft 


1.021(4) 0.5655(14) 0.1061(3) 0.01442(3) 
1.021(4) 0.5655(14) 0.1061(3) 0.01442(3) 
1.88(70) 0.624(233) 0.0855(319) 0.00252(94) 
0.7589(18) 0.2328(18) 0.008190(69) 0.0001513(12) 



Table 9: Eigenmode decomposition of /', for the fitting with the ES method. 



be checked against the hypothesis that \rj\ < 0.006. The results of fitting say that 
t) = —0.00082(31), which is highly consistent with the hypothesis. In this sense, the 
Bayesian prior is quite reasonable and self-consistent. 

Let us turn to the issue of quality of fitting and physical interpretation of Xaug- I n a 
naive sense of physical interpretation, we may count the prior condition as one of data 
points, and we consider the shift parameter r\ as a new parameter in the fitting. Hence, 
in this interpretation, the effective number of data points is D = D + 1, and the effective 
number of unknown parameters of the fitting function is P = P + 1. Accordingly, the 
effective degrees of freedom becomes d = D — P = D — P = d. 

Let us redefine the notation for the eigenmodes as follows: 



V5 



for i = {1,2,3,4} 



(40) 



(41) 



and 



A 5 



A, for £ = {1,2,3,4} 



(42) 
(43) 



Then, we can rewrite the xtuz as follows: 



X 2 

A aug 



5 A 2 
a=l A ° 



(44) 



where <5 a is defined similarly as in Eq. (j2"3"]l . We can prove with ease that all the eigenvec- 
tors are orthogonal to each other. If we assume that we may count the prior condition 
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as one of the data points, we realize that the effective degrees of freedom have been in- 
creased by one as we observe in Eq. (|44l) . The number of unknown parameters in fitting 
is P. Hence, even in the strict sense of physical interpretation, the effective degrees of 
freedom is d = D — P. Therefore, the xiug must follow the normal x 2 distribution with 

Xaug ~ ^ ± \/2d = d ± \/2d, as explained in Appendix |Appendix B.3| 

However, in our Bayesian prior information, we did not use the statistical information 
for a v and a v , but the optimal range of possible value of r\ to set the a n and a^. (i.e. 
On 7^ £ (rj) and a v ^ ^/Viv))- Hence, the choice of a v and a v could be larger or smaller 
(overestimated or underestimated) than the statistical value of them. As a consequence, 
our estimate of Xaug could be smaller or larger than the normal value of Xaug ~ d. Hence, 
in this approach of Bayesian method, we cannot rely on the strict statistical interpretation 
of Xaug- O ne good news is that the probability interpretation is still possible with Xaug; 
which allows for the quality of fittingfl and the model selection on the basis of the Bayesian 
statistics [2J|. Hence, we can, at least, tell from xiug which model or hypothesis is more 
probable. This is an important point because it justifies the fitting procedure that finds 
the most probable fitting parameters by minimizing Xaug- 

In our example of Bk, the Xaug/dof f° r the ES method is 0.019(14). 
In the limit of a n — > oo, we can remove the prior condition completely, which we call 
unconstrained ES method. In Appendix |Appendix A[ we prove that the unconstrained 
ES method is equivalent to the cutoff method. However, the original ES method is quite 
different from the cutoff method in the following sense. The effective number of degrees 
of freedom for the ES method is d — D — P = 1 while that for the cutoff method is 
d = (D — 1) — P = 0. In addition, the ES method is rigorously based on the Bayesian 
method and is subject to the probability interpretation. However, the cutoff method does 
not allow for the probability interpretation mainly because it modifies the Hilbert space 
for the covariance matrix. In addition, in the ES method, we modify the fitting function 
by the shift parameter rj, which we can monitor and gives us an estimate of how much we 
are changing the fitting function. In the case of the cutoff method, we do not know how 
much of the fitting function we are dumping into the null space of the covariance matrix. 
In order to illustrate the difference between the ES method and the cutoff method, we 
provide a pedagogical and heuristic example in Appendix |Appendix C| in which the ES 
method works well, but the cutoff method and the diagonal approximation fail manifestly. 

4-5. Bayesian method 

When we obtain the fitting function, we use the staggered chiral perturbation theory 
to expand it in powers of p 2 , a 2 , and m q . In the series expansion, we must truncate 
the higher order terms because we cannot include an arbitrary number of terms in the 
fitting function. One constraint is that we have only 4 data points of Bk for the SU(2) 
analysis. Hence, the fitting function can have at most 3 unknown parameters, if we want 
to perform the normal least \ 2 fitting based on the multivariate statistics theory. This 
means that we can include all the next to leading order (NLO) terms and one additional 



5 Here, the quality of fitting means that we can compare two different fitting procedures and determine 
which fitting is more reliable based on the Bayesian method. For example, we can compare the full 
covariance fitting and the ES method for Bji, since both of these methods allows for the probability 
interpretation. 
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i 


1 


2 


3 


4 


bi 


1.020(5) 


0.5659(13) 


0.1060(3) 


0.01442(3) 


a>i 


1.021(4) 


0.5655(14) 


0.1061(3) 


0.01442(3) 


10 5 • S, 


110(41) 


36.5(136) 


5.00(187) 


0.147(55) 


ft 


0.7583(16) 


0.2334(16) 


0.008194(69) 


0.0001515(12) 



Table 10: Eigenmode decomposition of /5 for the fitting with the Bayesian method. 



term at the next to next to leading order (NNLO). It is the fitting function of Eq. (Q} that 
we obtain following this premature logical path. This looks fine as long as the truncated 
terms at the higher order are under control such that the full covariance fitting works 
well. However, in our case of the SU(2) analysis on Bk, the full covariance fitting fails 
manifestly because the data have such a high precision that the truncated terms of the 
higher order are required to fit the data. We cannot add higher order terms to the fitting 
function in a normal sense of the multivariate statistical theory. Hence, the situation is 
checkmate as it is. T he quest ion is how we can get out of this trap. A natural answer is 
the Bayesian method 



In the Bayesian method, the prior condition behaves in the fitting as if it is one of 
the data points as explained in the previous subsection. Hence, it is possible to add n 
higher order terms as long as we impose m prior conditions on the fitting with n < m. 
In practice, we impose the same number of prior conditions as that of the higher order 
terms added to the fitting function as follows. The fitting function has three additional 
terms at higher order: 

f*(X) = f th (X) + c b 4 X 2 (lnX) 2 + 4x 2 (lnX) + c b 6 X 3 . (45) 

We impose the prior conditions on the fitting through the prior probability as in the 
previous subsection: 

6 (c b — a,k b) 2 

Xprior ^ t _2 " (46) 



k=4 



'k:B 



Since we know that c| = ± 1, we may choose ak,B = and (7k- b = 1. In the Bayesian 
method, we use xtug instead of x 2 , m the analysis, which is defined as 

L = \og(P(A\D,I)) (47) 
Xaug = (-2)L = X 2 +Xp ri or (48) 

where P(A\D,I) is the posterior pdf |24|. The Bayesian principle is that we determine 
the fitting parameters such that they maximize the posterior pdf or minimize Xaug- The 
main advantage of the Bayesian method is that it allows for probability interpretation 
and model selection as explained in Rcf. 24]. 

In Figure [6j we show the fitting results obtained using the Bayesian method. The 
fitting has x aug = 1-09(81). The effective number of the data points is D = 4 + 3 = 7 and 
the number of the unknown parameters is P — 6. Hence, the effective number of degrees 
of freedom is d = D — P = 1 = d, the same as the full covariance fitting. In Table ITOl we 
decompose the fitting function fR obtained using the Bayesian method in terms of the 
eigenmodes of the full covariance matrix. The fitting looks fine to our eyes. 
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Figure 6: %(l/n) vs. Xp on the C3 ensemble. All the parameters are the same as in Figure [2] Here, 
the red line represents the results of the Bayesian method. 

A major advantage of the Bayesian method is that it does NOT touch the covariance 
matrix at all unlike the cutoff method and the diagonal approximation. 

In the Bayesian method, we need to gauge the sensitivity of the fitting to the prior 
condition. In our case, we change the prior condition as follows: 



The mean value of Bk is shifted by 0.28a while the error bar increases by 9%. Hence, the 
difference between the ES method and the Bayesian method is 0.18c, which is smaller 
than the systematic error due to the ambiguity in the prior condition. Since both the 
ES and Bayesian methods are based on the Bayes theorem, they are equivalent to each 
other in that sense. When we obtain the higher order terms in Eq. (l45|) . we use the 
continuum chiral perturbation theory but not the staggered chiral perturbation theory. 
Hence, the functional form of each higher order term is approximate and not exact. 
From this standpoint, we cannot claim that the Bayesian method is better than the ES 
method. Therefore, we decide to quote the difference between the results of ES and 
Bayesian methods as our systematic error due to the ambiguity in the covariance fitting 
in the SW-2, and to choose the results of the Bayesian method as the central value. 



Ok-B = 1 -> 2 



(49) 



while we keep ak-B unchanged. The fitting results are changed as follows. 




0.5757(53) 0.5772(58) 
1.09(81) -> 0.31(23) 



(50) 
(51) 
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Figure 7: Histogram of the A4 eigenvalue. The histogram has been obtained using the bootstrap method. 
The number of the bootstrap Monte Carlo samples is 10,000. The normalization of the histogram is 
adjusted such that the total probability is one. The blue curve represents the T distribution. The details 
on the r distribution function are explained in Appendix | Appendix D] 



i 


scale 


Xi 


o"i(jk) 


crf(bs) 


af(bs) 


1 


1CT 6 


19.51 


1.13(10) 


1.08(9) 


1.16(11) 


2 


10- 7 


19.24 


1.08(7) 


1.05(7) 


1.11(8) 


3 


1Q -9 


75.79 


4.35(35) 


4.18(33) 


4.47(37) 


4 


10- 11 


110.9 


5.97(39) 


5.79(38) 


6.11(42) 



Table 11: Error analysis of eigenvalues. The scale represents the overall multiplication factor. <t; is 
the error of the Ai. The "jk" and "bs" index represent the jackknife method and the bootstrap method, 
respectively, of and erf' represents the left error and right error. 



5. Error analysis of the covariance matrix 

It is true that elements of the covariance matrix may, in general, have significantly 
larger errors than those of the small eigenvalues. However, what makes the significant 
difference in fitting is the small eigenvalues and the corresponding eigenmodes. Hence, 
we focus on the error analysis of the small eigenvalues. 

The probability distribution of the small eigenvalues is the gamma (r) distribution, 
which is proved in Appendix |Appendix D| Since the T distribution is different from the 
normal distribution, it is important to check whether our data respects the T distribution 
or not. In Figure UJ we show the probability density of the A4 eigenvalue in the format 
of histogram. The blue curve in the plot corresponds to the T distribution function. We 
find that the data respects the T distribution very well. 

In Table HT1 we present the error of the eigenvalues. The jackknife method is not very 
sensitive to the left-right asymmetry of the distribution and provides a rough estimate of 
the errors. However, the bootstrap method is by nature sensitive to the asymmetry. In 
the table, we show the jackknife error as well as the left & right errors of the bootstrap 
method. As one can see, the left errors are consistently smaller than the right errors, 
which indicates that the probability follows the T distribution. 

20 



6. Conclusion 

Here, we address an issue of covariance fitting on the highly correlated data: a general 
question frequently asked in the lattice QCD community. As an example, we have chosen 
the 4X3Y-NNLO fit of the Bk data based on the SU(2) staggered chiral perturbation 
theory explained in SW-1. It turns out that the smallest eigenvalue of the covariance 
matrix leads to a extremely poor fitting. If there exist very tiny eigenvalues in the full 
covariance matrix, the small discrepancy between the fitting function and the data can 
be dramatically amplified to make such a trouble that the fitting fails in passing through 
the data points within the statistical uncertainty. In order to get around the trouble, the 
lattice community have been applying the diagonal approximation, the cutoff method, the 
modified covariance method to the data analysis. All of these poor prescriptions modify 
the covariance matrix in one way or another and so lose the physical interpretation of x 
completely. Hence, we have been searching for a possible method which does not touch 
the covariance matrix and allow for the physical interpretation of the x 2 - A natural 
prescription which satisfies our requirements turns out to be the Bayesian method and 
its variations. 

In this paper, we suggest a new proposal: the eigenmode shift (ES) method. In this 
method, we shift the fitting function by a negligibly tiny amount, which allows the full 
covariance fitting. Note that we do not need to modify the covariance matrix at all in 
the ES method, and, in addition, x&ug nas a physical meaning based on the Bayesian 
probability interpretation. 

Another good approach is the Bayesian method in which we add as many higher 
order terms to the fitting function as the prior conditions such that the fitting works 
well (i.e. the Xa Ug has a reasonable value and the fitting looks fine to our eyes). One 
ambiguity in this method is our choice of higher order terms. In our example of Bk, 
we use the continuum chiral perturbation theory to obtain the functional form of each 
higher order term. Since it is a kind of approximation and not exact, we cannot claim 
that the Bayesian method is better than the ES method. 

Our final suggestion is that it might be a good choice if one can try both the ES 
and Bayesian methods in the data analysis and quote the difference as the systematic 
uncertainty due to an ambiguity in the covariance fitting. We apply this approach to the 
error analysis in SW-2. 

In order to help readers to digest the main points of this paper, in Appendix |Appcndix C| 
we provide a pedagogical example of data analysis, in which the diagonal approximation 
and the cut-off method fail in fitting manifestly, but the ES method and the Bayesian 
method work very well. This exemplifies an odd truth that the conventional wisdom in 
the diagonal approximation and the cut-off method might be falling apart in some cases. 
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Appendix A. Equivalence of cutoff method and unconstrained ES method 

Unconstrained ES method is the ES method whose shifting parameter, r], is not 
constrained by the Bayesian prior. It is the same as the ES method whose prior condition 
is set to a n = oo. In this section, we would like to prove that the unconstrained ES 
method is equivalent to the cutoff method. 

The x 2 of the cutoff method can be written in the following form: 

x l ut = (y-f\C'^\y-f). (A.l) 
Here, / is a vector representing the fitting function value, 

fi = /thpQ) , 

y is the £>-dimensional vector of average data points and C'~ l is the inverse covariance 
matrix of the cutoff method. If R number of eigenmodes, denoted byS+l<i<D with 
S = D — i?, are removed from the covariance matrix by the cutoff method, C'^ 1 can be 
written as follows, 



c 



c- 1 - E h Vi ^ Vi 



i=S+l 



E><* 

D 



(A.2) 



k=l 

Using the eigenmode decomposition given in Eq. (|A.2[) . we can rewrite Xcut as 



xL = E^-Ky-/k>] 2 , (A.3) 

l — l 



while x 2 is defined by 



X 2 = (y-f\C~ L \y-f) 

= Eir[<y-/k>] 2 - ( A - 4 ) 

2—1 

Here, C _1 is the inverse of the full covariance matrix. 

The x 2 °f the unconstrained ES method can be written in the following form: 

x 2 j ES = {y-f'\c- 1 \y-f), (A.5) 
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where /' is defined by 



D 



f = f th (X) + 

i=S+l 



The Xues can ^ e expanded as follows 

D 

A 



Xues 



X 



2 £ YMv-f\ v *)+ E h* 



i=S+l 
D 



1=5+1 
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X 



E T-K^-/k)] 2 + E i- Vi-{y-f\vi) 



i=S+l 
D 



Xcut 



E Y. m ~( y ~ ^ 



i=S+l 
2 



i=S+l 



Minimizing Xijes yields 



m = (y- f\vi) , 



(A.6) 



(A.7) 



(A.8) 



and it removes the last term in Eq. (|A.7|) . Hence, the minimizing Xcut gives the same 
results as those of the minimizing Xues- This proves our claim. 



Appendix B. Probability distribution of the minimized x 2 

Appendix B.l. Distribution of x 2 for the full covariance fitting 

As an introduction, we derive the probability distribution of the minimized value of 
X 2 for the full covariance fitting. Here, we assume that we have large number of data 
samples so that the sample covariance matrix is well determined. Generally, the fitting 
function is inaccurate and the inaccuracy leads us to the non-central x 2_ distribution. 
The non-central x 2 -distribution is defined as follows. 

Let us consider a set of independent random variables, 

{Xi,X 2} X 3 , • • • , Xk} ■ 

The random variables, Xi, follow the normal distribution with mean and variance 1, 
denoted by Xi ~ J\f(pi, 1). If we define a new random variable Q as follows, 



(B.l) 



then Q is distributed according to the non-central x 2 -distribution with degrees of freedom 
k and with non-centrality parameter 



K = E< 



(B.2) 



i=l 
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The x 2 °f the full covariance fitting can be written in the following form: 

x 2 = (y - f\C- x \y - f) . (B.3) 

Here, / is a vector representing the fitting function value, 

fi = fth{Xi) , 

y is the D-dimensional vector of average data points and C _1 is the inverse of the 
covariance matrix. The expectation value and covariance of the vector y — f are 

£[y -/] = </» (B.4) 

C[y-f]=C[y]=C. (B.5) 

Here, the vector (f> represents the error of the fitting function. If the fitting function is 
exact, = 0. According to the multi-dimensional central limit theorem, the average 
of data, y, is distributed as multivariate normal distribution. Hence, (y — /) follows 
the multivariate normal distribution with mean vector <fi and the covariance matrix C, 
(y-/)~A^,C). 

Let M be a non-singular square matrix satisfying 

MCM T = I , (B.6) 

where / is an identity matrix. If we define a new vector Y by 

Y = M(y-f), (B.7) 

then the expectation value and covariance of Y is 

£[Y] = M<j) (B.8) 
C[Y] = MCM T = I. (B.9) 

Since M is non-singular, the transformed vector Y is also distributed according to the 
multivariate normal distribution (Proof is given in Ref. 10]). Hence, Y is distributed 
according to the multivariate normal distribution with mean Mip and covariance matrix 
/. Note that identity covariance matrix implies mutual independence of Yi. 
In terms of Y, the x 2 , given in Eq. (|B.3[) . can be rewritten by 

X 2 = (Y\Y) 

D 

= E^ 2 ' (B.10) 

i=l 

Because Yi are independently distributed as normal distribution with mean [M<j)]i and 
variance 1, % 2 is distributed according to the non-central % 2 -distribution. If the number 
of fitting parameters is P, degrees of freedom of the distribution is D — P. The non- 
centrality parameter of the distribution is 

D 2 



\k 

fc=l 

(M0) T (Af0) = c/) T M T M(j> = (fFC- 1 ^, (B.ll) 



where we used M T M = C 1 at the last equality. 
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Appendix B.2. Distribution of \ 2 for the cutoff method 

As described in Appendix |Appcndix A| the x 2 of the cutoff method can be written 
in the following form: 

xlux = (y-f\c'- x \y-f), (B.i2) 

where C"" 1 is the inverse covariance matrix of the cutoff method, which can be written 
by 

C'- 1 = C- 1 - J2 hvi)(vi\ (B.13) 

i=S+l 1 



c- 1 = EfkX 



k=l K 



Here, R is the number of removed eigenmodes, and S = D — R. Let M be a non-singular 
square matrix satisfying 

MCM T = I, (B.14) 



where I is an identity matrix. Using the eigenmode decomposition, it is 

D 

~ A,. 



M = V^=|e fc )(t> fc | (B.15) 
fe =i ^ Afe 

D 

M" 1 = X)V^fck)<e*|, (B.16) 



fe=i 

where e& is a unit vector in fc-direction. In terms of the new vector Y = M{y — /), the 
Xcut becomes 

X 2 cut = (Y\{M T )- i C- 1 M- 1 \Y) 

= E^- E (B-17) 

fe=l i=S+l 

s 



E^ 2 - (B.18) 



fe=i 

Here, we used the following relation, 



{M T )- 1 C- 1 M- 1 = I- J2 \ e ^i\- (B.19) 



i=S+l 



As mentioned in Appendix | Appendix B.I[ Yi are independently distributed as normal 
distribution with mean [M<fi]i and variance 1, Yi ~ J\r([M<f>]i,i). Hence, Eq. (lB~T7l) 
shows that Xcut is distributed according to the non-central % 2 -distribution with degrees 
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of freedom D — P — R and with non-centrality parameter 



d 2 D 

k = ]T(J A/ *) - E (j M #< 

k=l i=S+l 



S * - n 2 



v k \4>) , (B.20) 



where (j>, which is a vector representing the error of fitting function, is defined in Eq. (|B.4I) . 
Here, D is the number of data points, P is the number of fitting parameters and R is the 
number of removed eigenmodes. If we assume that the fitting function is exact (0 = 0), 
Xcut follows the ^-distribution with degrees of freedom: S — P = D — P — R. 

Appendix B. 3. Distribution of \ 2 for the ES method 

In Appendix |Appcndix A| we show that unconstrained ES method is the same as 
the cutoff method. The probability distribution of the minimized value of x 2 m cutoff 
method is derived in Appendix |Appcndix B.2 and it is the ^-distribution with degrees 



of freedom D — P — R, if we assume that the fitting function is good enough. Hence, the 
probability distribution of the minimized value of \ 2 i n unconstrained ES method is the 
X 2 -distribution with degrees of freedom D — P — R. Here, R is the number of shifting 
eigenvectors that modify the fitting function. 

The deformation of the degrees of freedom in unconstrained ES method can be consid- 
ered as a consequence of adding new fitting parameters that control the shifting eigenvec- 
tors. We add R number of shifting parameters, {rji} with i = 1, 2, . . . , R, and it increases 
the number of fitting parameters to P + R. As a result, the degrees of freedom becomes 
D- P-R. 

In the normal ES method, we constrain the shifting parameters by Bayesian method. 
Hence, distribution of the augmented \ 2 m method can be interpreted that of in the 
Bayesian constrained fitting. 

By the Bayes theorem, 

P(A\B, I) cx P(B\A, I) x P{A\I) (B.21) 
Here, P(H)\A,I) is the likelihood function, which can be expressed as 

P(D\A, I) cx exp f-y) ( B - 22 ) 
In addition, let us assume that the prior probability P(A\I) can be written as 

P{ A\I)ocexp^-^f) (B.23) 

The effective number of degrees of freedom in the ES method is d = D — P = D — P = d, 
where D = D + R and P = P + R since there are R number of the prior conditions 
imposed on rji and there exists as much increase in the number of fitting parameters by 
R. Hence, the Xaug = X 2 + Xprior must follow the normal x 2 distribution with x^ug = 

d ± \pld — d ± \/2d. Therefore, the Xaug °f the ES method follows the same normal \ 2 
distribution as that of the full covariance fitting. 
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Appendix C. An example of fitting with random data 

In this section, we will show an example to see how the fitting methods, given in 
section 21 work. 

First, let us explain how the data has been generated. The true mean ^ of the data 

is 

Mi = /true(Xj) = (C.l) 

l—Xi 

a - 1) 

We choose 7 data points (D = 7) such that x l = 0.2 x ^ - with i = 1, 2, 3, . . . , 7. We 

6 

also choose the eigenvalues of the true covariance matrix as follows, 

A[ = lcH^ (C.2) 

for i = 1, 2, 3, . . . , 7. Then, we generate the random numbers to construct the eigenvectors 
which are orthonormal to each other by construction. Hence, we can obtain the true 
covariance matrix r,-j from the eigenvalues and eigenvectors. Note that the covariance 
matrix for the sample mean dj is related to the as follows: 

C---T- 
~~ N 

Hence, the eigenvalues of Cy is smaller by a factor of N as follows. 

For the notational convenience, we drop out the superscipt for the eigenvalues in the 
discussion in this section. 

Next, we generate the data yi following the probability distribution: Y ~ Af(fi,T). 
Here, we use the numerical algorithm introduced in Ref. [26[ to generate the data of yi 
according to the pdf: 

p( y \fi, r) = ZeM-\(y - m) • r- 1 • (y - /*)] (c.3) 



i 



(27r) c det(r) 1 /2 



(C.4) 



In this way, we collect 1000 random data samples for yi (i.e. N — 1000). 
We know that the true fitting function can be rewritten as follows, 

/truc(» = r— — = 1 + x + x 2 + ■ ■ ■ + x n + ■■■ (C.5) 
1 — X 

We choose the trial fitting function as follows: 

/triai(a;) = ci + c 2 x + c 3 x 2 (C.6) 
In addition, we also choose the trial fitting function for the Bayesian method as 

/Baycs(a;) = C\ + C 2 X + C 3 X 2 + C^X 3 , (C.7) 
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Figure C.8: Comparison of various fitting methods: full covariance fitting (full cov), diagonal approx- 
imation (diag), cut-off method (cut-off), modified covariance matrix method (mod). Bayesian method 
(Bayes) and eigenmode shift method (ES). 

where we will impose the prior condition on c|. 

We fit the data using the fitting methods such as the full covariance fitting, diagonal 
approximation, cut-off method, modified covariance method, ES method, and Bayesian 
method. In the cut-off method, we remove lowest two eigenmodes by imposing the cut-off 
(Aj/Ai > 1 .0 x 10~ 3 ), where Ai is the largest eigenvalue. In the ES method, we introduce 
two shift parameters r\\ and r\i in the direction of the two smallest eigenmodes. We 
impose the following prior condition on r\i. 

m ~ M(a u af) (C.8) 

ai — 

cr, = 0.001 

Here, the highest terms of the truncated are of order 0(x 3 ). By assuming that the 
coefficient is 0(1), we can estimate the size of truncated terms approximately as 

at w 1.0 x a; 3 w 1.0 x (0.1) 3 = 0.001 , 

where we take the average of the x value. 

In the Bayesian method, we impose the following prior condition on c\. 

c\ ~ 7VK<7l) (C.9) 
a4 = 

(74 = 1 

In Figure [C .81 we show the fitting results obtained using various fitting methods. As 
one can see in the plot, it is very clear that there is something seriously wrong with the 
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fit type 


Cl 


C2 


C3 


xVdof 


full cov 


0.9998(19) 


0.957(41) 


1.51(24) 


0.65(81) 


diag 


1.0055(105) 


0.818(211) 


2.13(97) 


0.45(96) 


cut-off 


1.0743(808) 


-1.67(285) 


12.8(122) 


0.70(118) 


mod 


0.9989(20) 


0.981(47) 


1.38(27) 


1.9(26) 


Bayes 


0.9999(19) 


0.950(42) 


1.59(27) 


0.63(79) 


ES 


0.9998(19) 


0.954(42) 


1.53(24) 


0.60(76) 



Table C.12: Fitting results of various fitting methods: The fitting details are explained in the text. 

cut-off method. In Table [CT21 we present results for various fitting methods. The results 
of the full covariance fitting is what we want to achieve in the fitting business ultimately. 
As one can see in the table, the diagonal approximation and the cut-off method fail, 
manifestly. In other words, errors of the Cj parameters are dramatically larger than those 
of the full covariance fitting, which indicates a symptom of a complete failure in fitting. 
In addition, we notice that both the ES method and the Bayesian method work very well 
in good agreement with the expectation, while the ES method looks marginally better 
in this example. In the case of the ES method, we obtain the following results for the 
shift parameters: 

rji = 0.00015 ±0.00154 
772 = -0.00023 ±0.00039, 

which are highly self-consistent with the prior condition. 
In this example, we learn the following lessons: 

• In this example, the full covariance fitting works very well. Hence, we know the 
fitting results a priori. In addition, we know the true answer to the fitting. 

• In this example, the diagonal approximation overestimates the error of the C\ pa- 
rameter by factor of 5. To make matters worse, the cut-off method overestimates 
the same quantity by factor of 40. This indicates that we should be very careful 
when we use these methods. They are very likely to overestimate the errors, which 
could lead to a wrong answer as in this example. Hence, we do not recommend 
both the diagonal approximation and the cut-off method for the data analysis to 
aim at the high precision. 

• In this example, the ES method and the Bayesian method work very well in good 
agreement with the full covariance fitting, while the ES method is marginally better. 
Hence, we decide using these two methods to do the data analysis for Bk as quoted 
in SW-2. 

Appendix D. Statistical analysis of the eigenvalues 

Appendix D.l. Distribution of the eigenvalues 

In this section, we will talk about the distribution of the eigenvalues of our covariance 
matrix. Let us consider a set of independent and identically distributed random variables 
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{Xi,X2, ■ ■ ■ ,Xn}. A sample variance can be defined by 



1 N _ 

— Y^{X % ~Xf, (D.l) 



where X is an average of Xi over i. If are distributed according to the normal- 
distribution with mean \i and variance cr 2 , then (N — l)s 2 /cr 2 is distributed according to 
the ^-distribution with degrees of freedom TV — 1 : 

(""l^E^^-xSr-i. (D.2) 

i—l 

Since we have N — 671 gauge configurations averaged from 9 measurements (see Table 
[I]), we can assume that the data is distributed according to the normal-distribution. 
Hence, the sample variance multiplied by a constant of our data is distributed according 
to the x 2 -distribution with degrees of freedom N — 1. 

If a random variable Y is distributed according to the x 2 -distribution with degrees of 
freedom v, then a new random variable multiplied by a positive constant c is distributed 
according to the gamma-distribution (T-distribution) with shape parameter k = v/2 and 
scale parameter 9 = 2c: 

Y ~ xl, 

cY ~ r(fc=^,0 = 2c). (D.3) 

Hence, sample variance of normally distributed random samples is distributed according 
to the T-distribution. 

There is an orthogonal transformation of the basis vectors that makes the covari- 
ance matrix diagonal with its diagonal elements equal to the eigenvalues. Note that a 
linear combination of normal random variables is also normally distributed as proved 
in Ref. [loj . We may regard the eigenvalues as a variance of normally distributed ran- 
dom variables (data). Therefore, the eigenvalues of the covariance matrix are distributed 
according to the T(^^-,9i) distribution. If we assume that the expectation values of 
eigenvalues are the measured eigenvalues, Xi, it gives 

where we used a property of r-distribution that a mean value of T(fc, 9) is k9 1(1 27 1. 

The probability distribution function (pdf) of T-distribution with shape parameter k 
and scale parameter 9 (denoted T(k,9)) is 

p(x;k,9)=x k - 1 -^^- ) forx>0. (D.5) 

Here, k and 9 should be positive and T(fc) is a gamma function. The mean and variance 
are k9 and k9 2 , respectively. 
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i 


scale 


Aj 




CTi(GD) 


a,(EP) 


CTi(FM) 


1 


l(T y 


19.51 


1.13 


1.07 


1.12 


1.07 


2 


io- 7 


19.24 


1.08 


1.05 


1.07 


1.05 


3 


io- 9 


75.79 


4.35 


4.14 


4.33 


4.15 


4 


IO" 11 


110.9 


5.97 


6.06 


5.92 


6.06 



Table D.13: Error of eigenvalues estimated by different methods. The scale represents the overall 
multiplication factor. Oi is the error of the A;. The "JK", "GD", "EP" and "FM" index represent the 
jackknife method, gamma-distribution method, error propagation method and error propagation with 
fourth moment method, respectively. Each <j{ element has about 8% error. 



Appendix D.2. Error of eigenvalues 

In table II 1 1 we presented the error of eigenvalues measured by resampling method : 
the jackknife method and the bootstrap method. In this section, we will describe other 
possible methods that estimate errors of eigenvalues. 

The first method is to use properties of the T-distribution of eigenvalues. Using 
results of Appendix | Appendix D.l[ we know that the distribution of eigenvalues are 
T-distribution with shape parameter k — jv ~ 1 and scale parameter 6i — N 2 _ 1 K- Since 

the variance of the distribution is k&f , the statistical error of the eigenvalues are \J k6f. 

The second method is using the error propagation formula. The error of eigenvalues 
basically originates from the error of the covariance matrix. Hence, we can measure the 
fluctuation of the eigenvalues fluctuating covariance matrix. In other words, we obtain 
eigenvalues of new covariance matrices fluctuated by 

Sij = Cij + <%, (D.6) 

where Sij is a Gaussian random noise. Note that the elements of the covariance matrix 
are correlated, which means that the random noise should be generated containing the 
information of the correlation. Hence, we generated the random noise following multi- 
variate normal distribution with zero mean and given covariance, Cov(Cy, C k i). 

Let us calculate the covariance matrix of the covariance matrix, Cov(C7y, G k i)- The 
covariance matrix defined in Eq. ([5]) can be rearranged as follows: 

n 

Xij{n) = j^—j[yi(n) - yi][yj{n) - (D.8) 
Hence, the covariance matrix of the covariance matrix can be defined as follows, 
Cov(C tj ,C k i) = _ ^ ^[xij(rt) - Xij][xM(n) - m] 



i— 5>,» - C l3 ][x kl {n) - Cm]- (D.9) 



N(N — 1) 

The third method is using the fourth moment of the normal distribution. It is well 
known that the 4th moment of normally distributed random variables, Zi, is 

£[(Zi - fXi)(Zj - fij){Z k - Hk){Zi - ///)] = o- V] o- k i + o ik Oji + o-uo-jk, (D.10) 
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where jii are the true expectation values of Zi and are the true covariance between 
Zi and Zj (See Ref. 10]). Let US cLSSUlTlG thclt Cij can be counted as a true covariance of 
mean of two random variables in the limit of N — > oo. Then, CVj is related to cry in the 
limit of N — >• oo: 

r -Irr 

if can be counted as a true covariance of mean of two random 

variables, Cov(Cy, G\./) can be estimated by 



Cov(C«,C7 H ) 



AT 



(Z fc - fi k )(Zi - 
iV 



(CikCji + CiiCjk), 



-e[a 



(D.ll) 



where we used Eq. (jP.101) to obtain the last equality. 

The results of error analysis on the eigenvalues are summarized in Table ID. 131 All 
the methods show the same results within statistical uncertainty. 
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